RNA interaction format: a general data format for RNA interactions

Abstract Summary RNA molecules play crucial roles in various biological processes. They mediate their function mainly by interacting with other RNAs or proteins. At present, information about these interactions is distributed over different resources, often providing the data in simple tab-delimited formats that differ between the databases. There is no standardized data format that can capture the nature of all these different interactions in detail. Availability and implementation Here, we propose the RNA interaction format (RIF) for the detailed representation of RNA–RNA and RNA–Protein interactions and provide reference implementations in C/C++, Python, and JavaScript. RIF is released under licence GNU General Public License version 3 (GNU GPLv3) and is available on https://github.com/RNABioInfo/rna-interaction-format.

We examined the formats of available RNA-RNA/Protein interaction databases including RNAinter v4.0 (Kang et al., 2022), miRTarBase v9.0 (Huang et al., 2022), sInterBase (Cohen et al., 2023), snoDB v2.0 (Bergeron et al., 2023), STRING (Mering et al., 2003).It is to be noted that we focused on the exported data, and the online resources are often more comprehensive.As summarised in Table S1, most commonly, tab-separated values (TSV) or comma-separated values (CSV) are used.In the STRING database, the interaction data can be exported as whitespace-separated text files with different granularities, which are accompanied with accessory data.In addition, the whole database can be exported as a SQL dump.Only RNAinter provides an application programming interface (API), but at the time of writing this manuscript it could not be reached.The records can be queried using different fields.Table S2 lists the information stored in each database.In all databases, specific identifiers which are either common integers (snoDB, sInterBase) or strings consisting of letters and numbers (miRTarBase, RNAinter) point to an entry.The symbol/names of the interaction partners are given in all databases, accompanied with the accession identifiers (IDs) when available.For example, sInterBase provides the accession ID for the query sRNAs and target mRNAs from EcoCyc (Keseler et al., 2005).More comprehensively, snoDB provides accession IDs to HGNC (Seal et al., 2023), RNAcentral (The RNAcentral Consortium, 2019), Ensembl (Martin et al., 2023), RefSeq (O'Leary et al., 2016), Rfam (Kalvari et al., 2021), snoRNABase (Lestrade and Weber, 2006), snoRNA atlas, and snOPY (Yoshihama et al., 2013).In contrast, RNAinter and miRTarBase do not provide any additional IDs in the exported format, but the online entries link to public databases.Other information about the interaction site such as the genomic coordinates are only provided by snoDB.In that regard, the biotype is not listed in sInterBase and miRTarBase, as these databases mainly contain mRNA targets for sRNAs and miRNAs, respectively.Moreover, the exported formats contain information about the supporting evidence.In sInterBase this only consists of the source of the interaction, which can be a study or other database (e.g., sRNATarBase).Similarly, miRTarBase lists the conducted methodology that supports the interaction (e.g., western blot, qRT-PCR) and links to the study in Pubmed if available.In addition, the supporting evidence is classified into 'weak support' when the evidence is not substantial.Similary, RNAinter provides a confidence score that distinguishes between weak (e.g., ChIP-seq and CLIP-seq) and strong (e.g., RNA immunoprecipitation and luciferase reporter assay) evidence.In addition, the prediction method that supports the interaction is also listed.In contrast, snoDB exports no concrete evidence that supports the interaction.

Data Formats
We examined the available file formats for RNA-RNA/Protein interactions to store interaction data from miRTarBase.In particular, we exported the data associated with the miRTarBase ID MIRT000021 (Table S3).In addition, the online entry contains addtional information of the interaction sites, genomic coordinates, and sequence that were also used.In the following, we describe this interaction using different data formats.It is to be noted that the sequence/structure of TP53INP1 has been truncated.In the following, the interaction MIRT000021 is In SBML, the closing tags mainly contribute to this and generate overhead when parsing large interaction networks.In addition, the absence of arrays/lists in SBML requires additional attributes to store similar data points (e.g., evidence1, evidence2).However, an array package is currently under development and may be available in future versions.Attributes are not restricted and can be defined on any level, which inflates the format and prevents uniform use.

Table S1 .
Overview of the different databases examined in this study.

Table S2 .
Features in the export format of the considered databases.

Table S3 .
Exported data of MIRT000021 from MiRTarBase described using different data formats.SBML requires 3607 characters to depict MIRT000021.In contrast, RIF only requires 2273 characters, thereby decreasing the filesize by 37%.